New Frameworks for Offline and Streaming Coreset Constructions

نویسندگان

  • Vladimir Braverman
  • Dan Feldman
  • Harry Lang
چکیده

Let P be a set (called points), Q be a set (called queries) and a function f : P×Q→ [0,∞) (called cost). For an error parameter > 0, a set S ⊆ P with a weight function w : P → [0,∞) is an ε-coreset if ∑ s∈S w(s)f(s, q) approximates ∑ p∈P f(p, q) up to a multiplicative factor of 1 ± ε for every given query q ∈ Q. Coresets are used to solve fundamental problems in machine learning of streaming and distributed data. We construct coresets for the k-means clustering of n input points, both in an arbitrary metric space and d-dimensional Euclidean space. For Euclidean space, we present the first coreset whose size is simultaneously independent of both d and n. In particular, this is the first coreset of size o(n) for a stream of n sparse points in a d ≥ n dimensional space (e.g. adjacency matrices of graphs). We also provide the first generalizations of such coresets for handling outliers. For arbitrary metric spaces, we improve the dependence on k to k log k and present a matching lower bound. For M -estimator clustering (special cases include the well-known k-median and k-means clustering), we introduce a new technique for converting an offline coreset construction to the streaming setting. Our method yields streaming coreset algorithms requiring the storage of O(S+ k log n) points, where S is the size of the offline coreset. In comparison, the previous state-of-the-art was the merge-and-reduce technique that required O(S log n) points, where a is the exponent in the offline construction’s dependence on −1. For example, combining our offline and streaming results, we produce a streaming metric k-means coreset algorithm using O( −2k log k log n) points of storage. The previous state-of-the-art required O( −4k log k log n) points. ∗Department of Computer Science, Johns Hopkins University. This material is based upon work supported in part by the National Science Foundation under Grant No. 1447639, by the Google Faculty Award and by DARPA grant N660001-1-2-4014. Its contents are solely the responsibility of the authors and do not represent the official view of DARPA or the Department of Defense. †Department of Computer Science, University of Haifa ‡Department of Mathematics, Johns Hopkins University. This material is based upon work supported by the Franco-American Fulbright Commission. ar X iv :1 61 2. 00 88 9v 1 [ cs .D S] 2 D ec 2 01 6

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

StreamKM++: A Clustering Algorithm for Data Streams∗

We develop a new k-means clustering algorithm for data streams, which we call StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm [1]. To compute the small sample, we propose two new techniques. First, we use a non-uniform sampling approach similar to the k-means++ seeding procedure to obtain small core...

متن کامل

Clustering High Dimensional Dynamic Data Streams

We present data streaming algorithms for the kmedian problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space {1, 2, . . .∆}. Our algorithms use k −2poly(d log ∆) space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of k centers the cost of...

متن کامل

A StreamKM++: A Clustering Algorithm for Data Streams

We develop a new k-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute the small sample, we propose two new techniques. First, we use an adaptive, non-u...

متن کامل

Clustering Problems on Sliding Windows

We explore clustering problems in the streaming sliding window model in both general metric spaces and Euclidean space. We present the first polylogarithmic space O(1)-approximation to the metric kmedian and metric k-means problems in the sliding window model, answering the main open problem posed by Babcock, Datar, Motwani and O’Callaghan [5], which has remained unanswered for over a decade. O...

متن کامل

Random Projections for k-Means: Maintaining Coresets Beyond Merge & Reduce

We give a new construction for a small space summary satisfying the coreset guarantee of a data set with respect to the k-means objective function. The number of points required in an offline construction is in Õ(kǫ−2 min(d, kǫ−2)) which is minimal among all available constructions. Aside from two constructions with exponential dependence on the dimension, all known coresets are maintained in d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1612.00889  شماره 

صفحات  -

تاریخ انتشار 2016